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Abstract. This purpose of this introductory paper is threefold. First, it introduces the Monte Carlo method with 
emphasis on probabilistic machine learning. Second, it reviews the main building blocks of modem Markov chain 
Monte Carlo simulation, thereby providing and introduction to the remaining papers of this special issue. Lastly, 
it discusses new interesting research horizons. 
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1. Introduction 

A recent survey places the Metropolis algorithm among the ten algorithms that have had the 
greatest influence on the development and practice of science and engineering in the 20th 
century (Beichl & Sullivan, 2000). This algorithm is an instance of a large class of sampling 
algorithms, known as Markov chain Monte Carlo (MCMC). These algorithms have played 
a significant role in statistics, econometrics, physics and computing science over the last 
two decades. There are several high-dimensional problems, such as computing the volume 
of a convex body in d dimensions, for which MCMC simulation is the only known general 
approach for providing a solution within a reasonable time (polynomial in d) (Dyer, Frieze, 
& Kannan, 1991; Jerrum & Sinclair, 1996). 

While convalescing from an illness in 1946, Stan Ulam was playing solitaire. It, then, 
occurred to him to try to compute the chances that a particular solitaire laid out with 52 cards 
would come out successfully (Eckhard, 1987). After attempting exhaustive combinatorial 
calculations, he decided to go for the more practical approach of laying out several solitaires 
at random and then observing and counting the number of successful plays. This idea of 
selecting a statistical sample to approximate a hard combinatorial problem by a much 
simpler problem is at the heart of modern Monte Carlo simulation. 
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Stan Ulam soon realised that computers could be used in this fashion to answer ques- 
tions of neutron diffusion and mathematical physics. He contacted John Von Neumann, 
who understood the great potential of this idea. Over the next few years, Ulam and Von 
Neumann developed many Monte Carlo algorithms, including importance sampling and 
rejection sampling. Enrico Fermi in the 1930’s also used Monte Carlo in the calculation of 
neutron diffusion, and later designed the FERMIAC, a Monte Carlo mechanical device that 
performed calculations (Anderson, 1986). In the 1940’s Nick Metropolis, a young physicist, 
designed new controls for the state-of-the-art computer (ENIAC) with Klari Von Neumann, 
John’s wife. He was fascinated with Monte Carlo methods and this new computing device. 
Soon he designed an improved computer, which he named the MANIAC in the hope that 
computer scientists would stop using acronyms. During the time he spent working on the 
computing machines, many mathematicians and physicists (Fermi, Von Neumann, Ulam, 
Teller, Richtmyer, Bethe, Feynman, & Gamow) would go to him with their work problems. 
Eventually in 1949, he published the first public document on Monte Carlo simulation with 
Stan Ulam (Metropolis & Ulam, 1949). This paper introduces, among other ideas, Monte 
Carlo particle methods, which form the basis of modern sequential Monte Carlo methods 
such as bootstrap filters, condensation, and survival of the fittest algorithms (Doucet, de 
Freitas, & Gordon, 2001). Soon after, he proposed the Metropolis algorithm with the Tellers 
and the Rosenbluths (Metropolis et al., 1953). 

Many papers on Monte Carlo simulation appeared in the physics literature after 1953. 
From an inference perspective, the most significant contribution was the generalisation of 
the Metropolis algorithm by Hastings in 1970. Hastings and his student Peskun showed that 
Metropolis and the more general Metropolis-Hastings algorithms are particular instances 
of a large family of algorithms, which also includes the Boltzmann algorithm (Hastings, 
1970; Peskun, 1973). They studied the optimality of these algorithms and introduced the 
formulation of the Metropolis-Hastings algorithm that we adopt in this paper. In the 1980’s, 
two important MCMC papers appeared in the fields of computer vision and artificial in- 
telligence (Geman & Geman, 1984; Pearl, 1987). Despite the existence of a few MCMC 
publications in the statistics literature at this time, it is generally accepted that it was only in 
1990 that MCMC made the first significant impact in statistics (Gelfand & Smith, 1990). In 
the neural networks literature, the publication of Neal (1996) was particularly influential. 

In the introduction to this special issue, we focus on describing algorithms that we feel 
are the main building blocks in modem MCMC programs. We should emphasize that in 
order to obtain the best results out of this class of algorithms, it is important that we do not 
treat them as black boxes, but instead try to incorporate as much domain specific knowledge 
as possible into their design. MCMC algorithms typically require the design of proposal 
mechanisms to generate candidate hypotheses. Many existing machine learning algorithms 
can be adapted to become proposal mechanisms (de Freitas et al., 2001). This is often 
essential to obtain MCMC algorithms that converge quickly. In addition to this, we believe 
that the machine learning community can contribute significantly to the solution of many 
open problems in the MCMC field. For this purpose, we have outlined several “hot” research 
directions at the end of this paper. Finally, readers are encouraged to consult the excellent 
texts of Chen, Shao, and Ibrahim (2001), Gilks, Richardson, and Spiegelhalter (1996), Liu 
(2001), Meyn and Tweedie (1993), Robert and Casella (1999) and review papers by Besag 
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et al. (1995), Brooks (1998), Diaconis and Saloff-Coste (1998), Jerrum and Sinclair (1996), 
Neal (1993), and Tierney (1994) for more information on MCMC. 

The remainder of this paper is organised as follows. In Part 2, we outline the general 
problems and introduce simple Monte Carlo simulation, rejection sampling and importance 
sampling. Part 3 deals with the introduction of MCMC and the presentation of the most 
popular MCMC algorithms. In Part 4, we describe some important research frontiers. To 
make the paper more accessible, we make no notational distinction between distributions 
and densities until the section on reversible jump MCMC. 

2. MCMC motivation 


MCMC techniques are often applied to solve integration and optimisation problems in 
large dimensional spaces. These two types of problem play a fundamental role in machine 
learning, physics, statistics, econometrics and decision analysis. The following are just some 
examples. 


1. Bayesian inference and learning. Given some unknown variables x e X and data y e y, 
the following typically intractable integration problems are central to Bayesian statistics 


(a) Normalisation. To obtain the posterior p(x \ y ) given the prior plx) and likelihood 
ply | x), the normalising factor in Bayes’ theorem needs to be computed 
p{y | x)plx) 


plx | y) = 


f x P(y I x')plx')dx r 


(b) Marginalisation. Given the joint posterior of (x, z) £ X x Z, we may often be 
interested in the marginal posterior 

plx I y ) - j z P( x ’ z i y^ dz - 

(c) Expectation. The objective of the analysis is often to obtain summary statistics of 
the form 

^■p(x\y)lflx)) = f fix) plx | y)dx 
Jx 

for some function of interest f : X —> M"/ integrable with respect to plx \ y). 
Examples of appropriate functions include the conditional mean, in which case 
fix ) = x, orthe conditional covariance of x where fix) = xxf — E p (A . } (x ) E' ; | }) ( (x ) . 


2. Statistical mechanics. Here, one needs to compute the partition function Z of a system 
with states s and Hamiltonian E(s) 


z =X>p|"-frl 


where k is the Boltzmann’s constant and T denotes the temperature of the system. 
Summing over the large number of possible configurations is prohibitively expensive 
(Baxter, 1982). Note that the problems of computing the partition function and the 
normalising constant in statistical inference are analogous. 
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3. Optimisation. The goal of optimisation is to extract the solution that minimises some 
objective function from a large set of feasible solutions. In fact, this set can be contin- 
uous and unbounded. In general, it is too computationally expensive to compare all the 
solutions to find out which one is optimal. 

4. Penalised likelihood model selection. This task typically involves two steps. First, one 
finds the maximum likelihood (ML) estimates for each model separately. Then one uses 
a penalisation term (for example MDL, BIC or AIC) to select one of the models. The 
problem with this approach is that the initial set of models can be very large. Moreover, 
many of those models are of not interest and, therefore, computing resources are wasted. 

Although we have emphasized integration and optimisation, MCMC also plays a funda- 
mental role in the simulation of physical systems. This is of great relevance in nuclear 
physics and computer graphics (Chenney & Forsyth, 2000; Kalos & Whitlock, 1986; Veach 
& Guibas, 1997). 

2.1. The Monte Carlo principle 

The idea of Monte Carlo simulation is to draw an i.i.d. set of samples {x ( ' > }|L 1 from a target 
density p(x) defined on a high-dimensional space X (e.g. the set of possible configurations 
of a system, the space on which the posterior is defined, or the combinatorial set of feasible 
solutions). These N samples can be used to approximate the target density with the following 
empirical point-mass function 

1 N 

Pn(x) = 

™ ;= l 

where S x w(x) denotes the delta-Dirac mass located at x <l> . Consequently, one can approx- 
imate the integrals (or very large sums) 1(f) with tractable sums l,\(f ) that converge as 
follows 


1 N r 

h(f) = — X) /(* (,) ) 1(f) = J* f(x)p(x)dx. 

That is, the estimate /#(/) is unbiased and by the strong law of large numbers, it will 
almost surely (a.s.) converge to 1(f). If the variance (in the univariate case for simplicity) 
of f(x) satisfies a 2 = E p ( x )(f 2 (x)) — I 2 (f) < oo, then the variance of the estimator 
Lv(/) is equal to var(/, v (/)) = A and a central limit theorem yields convergence in 
distribution of the error 

VN(I N (f) - 1(f)) ^ AT(0, a 2 ), 

where => denotes convergence in distribution (Robert & Casella, 1999; Section 3.2). 
The advantage of Monte Carlo integration over deterministic integration arises from the 
fact that the former positions the integration grid (samples) in regions of high probability. 
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The N samples can also be used to obtain a maximum of the objective function p(x) as 
follows 

x = argmax p(x (,) ) 

xW;i=l,...,N 

However, we will show later that it is possible to construct simulated annealing algorithms 
that allow us to sample approximately from a distribution whose support is the set of global 
maxima. 

When p(x) has standard form, e.g. Gaussian, it is straightforward to sample from it using 
easily available routines. However, when this is not the case, we need to introduce more 
sophisticated techniques based on rejection sampling, importance sampling and MCMC. 


2.2. Rejection sampling 

We can sample from a distribution p(x), which is known up to a proportionality constant, 
by sampling from another easy-to- sample proposal distribution q{x) that satisfies p(x) < 
Mq(x), M < oo, using the accept/reject procedure describe in figure 1 (see also figure 2). 
The accepted x <,) can be easily shown to be sampled with probability p(x) (Robert & 


Set i = 1 

Repeat until i = N 

1. Sample x^~q(x) and«~W( 0 ,i)- 

2. If u < then accept a:**) and increment the counter i by 

1. Otherwise, reject. 


Figure 1. Rejection sampling algorithm. Here, u ~ U( o,i> denotes the operation of sampling a uniform random 
variable on the interval (0, 1). 



Figure 2. Rejection sampling: Sample a candidate jr (l ' and a uniform variable u. Accept the candidate sample if 
uMq(x ('*) < p(x (l) ), otherwise reject it. 
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Casella, 1999, p. 49). This simple method suffers from severe limitations. It is not always 
possible to bound p(x)/q(x) with a reasonable constant M over the whole space X. If M 
is too large, the acceptance probability 

PtCy accepted) = Pt(m < = 

will be too small. This makes the method impractical in high-dimensional scenarios. 


2.3. Importance sampling 


Importance sampling is an alternative “classical” solution that goes back to the 1940’s; 
see for example (Geweke, 1989; Rubinstein, 1981). Let us introduce, again, an arbitrary 
importance proposal distribution q(x) such that its support includes the support of p(x). 
Then we can rewrite 1(f) as follows 


un-S 


f(x)w(x)q(x)dx 


where w(x) = is known as the importance weight. Consequently, if one can simulate 
N i.i.d. samples according to q(x) and evaluate w(x (,) ), a possible Monte Carlo 

estimate of 1(f) is 


W) = E/(x ( '>(x (i) ) 


This estimator is unbiased and, under weak assumptions, the strong law of large num- 
bers applies, that is ! n(J) — > 1(f)- It is clear that this integration method can also be 
interpreted as a sampling method where the posterior density p(x) is approximated by 


Pn(x) = w(x (l) ) S x m(x) 


and /at(/) is nothing but the function fix) integrated with respect to the empirical measure 
p N (x). 

Some proposal distributions q(x) will obviously be preferable to others. An important 
criterion for choosing an optimal proposal distribution is to find one that minimises the 
variance of the estimator /#(/). The variance of f(x)w(x) with respect to q(x) is given by 

var q ( x )(f(x)w(xJ) = E qM (f 2 (x)w 2 (x)) - I 2 (f) (8) 

The second term on the right hand side does not depend on q(x) and hence we only need 
to minimise the first term, which according to Jensen’s inequality has the following lower 
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bound 


E q(x) (f 2 (x)w 2 (x)) > (E ?w (|/(x)|u;(x))) 2 


(/'• 


This lower bound is attained when we adopt the following optimal importance distribution 


q\x) = 


/ \f{x)\p{x)dx 


The optimal proposal is not very useful in the sense that it is not easy to sample from 
\f(x)\p(x). However, it tells us that high sampling efficiency is achieved when we focus 
on sampling from p(x) in the important regions where | f(x)\p(x) is relatively large; hence 
the name importance sampling. 

This result implies that importance sampling estimates can be super-efficient. That is, 
for a a given function fix), it is possible to find a distribution qix) that yields an estimate 
with a lower variance than when using a perfect Monte Carlo method, i.e. with q{x) = pix). 
This property is often exploited to evaluate the probability of rare events in communication 
networks (Smith, Shaft, & Gao, 1997). There the quantity of interest is a tail probability 
(bit error rate) and hence fix) = I/.;(.r) where l E {x) = 1 if x e E and 0 otherwise (see 
figure 3). One could estimate the bit error rate more efficiently by sampling according to 
q{x) ex I K (x)pix) than according to qix) = pix). That is, it is wasteful to propose candidates 
in regions of no utility. In many applications, the aim is usually different in the sense that 




Figure 3. Importance sampling: one should place more importance on sampling from the state space regions that 
matter. In this particular example one is interested in computing a tail probability of error (detecting infrequent 
abnormalities). 
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one wants to have a good approximation of p(x ) and not of a particular integral with respect 
to p(x), so we often seek to have q(x) — p(x). 

As the dimension of the x increases, it becomes harder to obtain a suitable q(x) from 
which to draw samples. A sensible strategy is to adopt a parameterised q(x, 9) and to 
adapt 9 during the simulation. Adaptive importance sampling appears to have originated 
in the structural safety literature (Bucher, 1988), and has been extensively applied in the 
communications literature (Al-Qaq, Devetsikiotis, & Townsend, 1995; Remondo et al., 
2000). This technique has also been exploited recently in the machine learning community 
(de Freitas et al., 2000; Cheng & Druzdzel, 2000; Ortiz & Kaelbling, 2000; Schuurmans & 
Southey, 2000). A popular adaptive strategy involves computing the derivative of the first 
term on the right hand side of Eq. (8) 

D{9) = E q( X ,e)(^f 2 (x)w(x, ^ j 


and then updating the parameters as follows 




dw(x (,) , 9,) 
d9 t 


where a is a learning rate and x <l) ~ q(x. 9). Other optimisation approaches that make use 
of the Hessian are also possible. 

When the normalising constant of p(x) is unknown, it is still possible to apply the 
importance sampling method by rewriting 1(f) as follows: 

f f(x)w(x)q(x)dx 
1 f w(x)q(x) dx 


where w(x) oc is now only known up to a normalising constant. The Monte Carlo 
estimate of 1(f) becomes 


h(f) 




= ff(^)w(^) 


where w(x ll > ) is a normalised importance weight. For N finite, l,\f f ) is biased (ratio of two 
estimates) but asymptotically, under weak assumptions, the strong law of large numbers 
applies, that is /#(/) 1(f). Under additional assumptions a central limit theorem can 

be obtained (Geweke, 1989). The estimator /#(/) has been shown to perform better than 
/#(/) in some setups under squared error loss (Robert & Casella, 1999). 

If one is interested in obtaining M i.i.d. samples from Pn(x), then an asymptotically 
(N/M — »■ oo ) valid method consists of resampling M times according to the discrete distri- 
bution Pn(x). This procedure results in M samples x U) with the possibility that x u> = x < ’ ) 



INTRODUCTION 


13 


for i ^ j . This method is known as sampling importance resampling (SIR) (Rubin, 1988). 
After resampling, the approximation of the target density is 

l M 

Pitix) = ~ T2 Sx<n(x') (13) 

M “ 

The resampling scheme introduces some additional Monte Carlo variation. It is, therefore, 
not clear whether the SIR procedure can lead to practical gains in general. However, in 
the sequential Monte Carlo setting described in Section 4.3, it is essential to carry out this 
resampling step. 

We conclude this section by stating that even with adaptation, it is often impossible to 
obtain proposal distributions that are easy to sample from and good approximations at the 
same time. For this reason, we need to introduce more sophisticated sampling algorithms 
based on Markov chains. 

3. MCMC algorithms 

MCMC is a strategy for generating samples x (l> while exploring the state space X using a 
Markov chain mechanism. This mechanism is constructed so that the chain spends more 
time in the most important regions. In particular, it is constructed so that the samples x in 
mimic samples drawn from the target distribution p(x). (We reiterate that we use MCMC 
when we cannot draw samples from p(x) directly, but can evaluate p(x) up to a normalising 
constant.) 

It is intuitive to introduce Markov chains on finite state spaces, where x (,) can only take 
s discrete values x (l> e X = [x\, X2 , . . . , x s } . The stochastic process x <,) is called a Markov 
chain if 

P (xV\x«- 1 \...,xV) = T(xV\x«-»), 


The chainis homogeneous if T = T(x {l) | remains invariant for all z, with T(x <l> \ 

x (,_1) ) = 1 for any i. That is, the evolution of the chain in a space X depends solely on the 
current state of the chain and a fixed transition matrix. 

As an example, consider a Markov chain with three states (s = 3) and a transition graph 
as illustrated in figure 4. The transition matrix for this example is 


' 0 1 O' 

0 0.1 0.9 

0.6 0.4 0 


If the probability vector for the initial state is /x(x rl) ) = (0.5, 0.2, 0.3), it follows that 
p(x m )T = (0.2, 0.6, 0.2) and, after several iterations (multiplications by T), the product 
p(x m )T‘ converges to p(x) = (0.2, 0.4, 0.4). No matter what initial distribution p(x m ) 
we use, the chain will stabilise at p(x) = (0.2, 0.4, 0.4). This stability result plays a funda- 
mental role in MCMC simulation. For any starting point, the chain will convergence to the 
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0.1 



0.6 


Figure 4. Transition graph for the Markov chain example with X = [xi , xi, *3). 

invariant distribution p(x), as long as T is a stochastic transition matrix that obeys the 
following properties: 

1 . Irreducibility. For any state of the Markov chain, there is a positive probability of visiting 
all other states. That is, the matrix T cannot be reduced to separate smaller matrices, 
which is also the same as stating that the transition graph is connected. 

2. Aperiodicity. The chain should not get trapped in cycles. 

A sufficient, but not necessary, condition to ensure that a particular p{x) is the desired 
invariant distribution is the following reversibility (detailed balance) condition 

p(x (i) )T(x (i ~ l) | x (i) ) = p(x«-V)T(x® | #“ 1) ). 

Summing both sides over gives us 

P( x<i) ) = J^ p(x (i ~ l) )T(x m | ac (, '- 1) ). 

MCMC samplers are irreducible and aperiodic Markov chains that have the target distribu- 
tion as the invariant distribution. One way to design these samplers is to ensure that detailed 
balance is satisfied. However, it is also important to design samplers that converge quickly. 
Indeed, most of our efforts will be devoted to increasing the convergence speed. 

Spectral theory gives us useful insights into the problem. Notice that p(x) is the left 
eigenvector of the matrix T with corresponding eigenvalue 1. In fact, the Perron-Frobenius 
theorem from linear algebra tells us that the remaining eigenvalues have absolute value less 
than 1. The second largest eigenvalue, therefore, determines the rate of convergence of the 
chain, and should be as small as possible. 

The concepts of irreducibility, aperiodicity and invariance can be better appreciated once 
we realise the important role that they play in our lives. When we search for information on 
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the World-Wide Web, we typically follow a set of links (Berners-Lee et al., 1994). We can 
interpret the webpages and links, respectively, as the nodes and directed connections in a 
Markov chain transition graph. Clearly, we (say, the random walkers on the Web) want to 
avoid getting trapped in cycles (aperiodicity) and want to be able to access all the existing 
webpages (irreducibility). Let us consider, now, the popular information retrieval algorithm 
used by the search engine Google, namely PageRank (Page et al„ 1998). PageRank requires 
the definition of a transition matrix with two components T — L + E.L is a large link matrix 
with rows and columns corresponding to web pages, such that the entry L t j represents the 
normalised number of links from web page i to web page j . £ is a uniform random matrix 
of small magnitude that is added to L to ensure irreducibility and aperiodicity. That is, the 
addition of noise prevents us from getting trapped in loops, as it ensures that there is always 
some probability of jumping to anywhere on the Web. From our previous discussion, we 
have 


p(x (i+1) )[L+E] = p(x‘) 

where, in this case, the invariant distribution (eigenvector) p(x) represents the rank of a 
webpage x. Note that it is possible to design more interesting transition matrices in this 
setting. As long as one satisfies irreducibility and aperiodicity, one can incorporate terms 
into the transition matrix that favour particular webpages or that bias the search in useful 
ways. 

In continuous state spaces, the transition matrix T becomes an integral kernel K and 
p{x) becomes the corresponding eigenfunction 

J p(x (,) )K(x (,+l) | x (0 ) dx (i) = p(x (i+1) ). 

The kernel K is the conditional density of x (l+l) given the value x (,} . It is a mathematical 
representation of a Markov chain algorithm. In the following subsections we describe 
various of these algorithms. 

3.1. The Metropolis-Hastings algorithm 

The Metropolis-Hastings (MH) algorithm is the most popular MCMC method (Hastings, 
1970; Metropolis et al., 1953). In later sections, we will see that most practical MCMC 
algorithms can be interpreted as special cases or extensions of this algorithm. 

An MH step of invariant distribution p{x) and proposal distribution q(x* \ x) involves 
sampling a candidate value x* given the current value x according to q(x* \ x). The Markov 
chain then moves towards x* with acceptance probability A(x, x*) = min{l, [p(x)q(x* | 
x)]~ 1 p(x*)q(x | x*)}, otherwise it remains at x. The pseudo-code is shown in figure 5, while 
figure 6 shows the results of running the MH algorithm with a Gaussian proposal distribution 
q(x* | x <0 ) = jV'(x (0 , 100) and a bimodal target distribution p(x) oc 0.3 exp(— 0.2x 2 ) + 
0.7 exp(— 0.2(x — 10) 2 ) for 5000 iterations. As expected, the histogram of the samples 
approximates the target distribution. 
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The MH algorithm is very simple, but it requires careful design of the proposal distri- 
bution q{x* | x ). In subsequent sections, we will see that many MCMC algorithms arise by 
considering specific choices of this distribution. In general, it is possible to use suboptimal 
inference and learning algorithms to generate data-driven proposal distributions. 

The transition kernel for the MH algorithm is 

I * (i) ) = «(* (i+1) I * (i+1) ) + M* (i+1) M* (0 )> 
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where r(x u> ) is the term associated with rejection 

It is fairly easy to prove that the samples generated by MH algorithm will mimic samples 
drawn from the target distribution asymptotically. By construction, satisfies the detailed 

balance condition 


p(x (,) )A' mh (x (!+ 1) | x (!) ) = | x (!+1) ) 


and, consequently, the MH algorithm admits p(x) as invariant distribution. To show that 
the MH algorithm converges, we need to ensure that there are no cycles (aperiodicity) 
and that every state that has positive probability can be reached in a finite number of steps 
(irreducibility). Since the algorithm always allows for rejection, it follows that it is aperiodic. 
To ensure irreducibility, we simply need to make sure that the support of q(-) includes the 
support of p{-). Under these conditions, we obtain asymptotic convergence (Tierney, 1994, 
Theorem 3, p. 1717). If the space X is small (for example, bounded in R"), then it is 
possible to use minorisation conditions to prove uniform (geometric) ergodicity (Meyn & 
Tweedie, 1993). It is also possible to prove geometric ergodicity using Foster-Lyapunov 
drift conditions (Meyn & Tweedie, 1993; Roberts & Tweedie, 1996). 

The independent sampler and the Metropolis algorithm are two simple instances of the 
MH algorithm. In the independent sampler the proposal is independent of the current state, 
q(x* | x u> ) = q(x*). Hence, the acceptance probability is 




= min 


P(x*)q(x^) 

p(x^)q{x*) 


= min 


1 w(x*) j 


This algorithm is close to importance sampling, but now the samples are correlated since 
they result from comparing one sample to the other. The Metropolis algorithm assumes a 
symmetric random walk proposal q(x* \ x in ) = q(x u> \ x*) and, hence, the acceptance ratio 
simplifies to 




= min 


p(x*) 1 

'pW i 


Some properties of the MH algorithm are worth highlighting. Firstly, the normalising 
constant of the target distribution is not required. We only need to know the target distribution 
up to a constant of proportionality. Secondly, although the pseudo-code makes use of a single 
chain, it is easy to simulate several independent chains in parallel. Lastly, the success or 
failure of the algorithm often hinges on the choice of proposal distribution. This is illustrated 
in figure 7. Different choices of the proposal standard deviation a* lead to very different 
results. If the proposal is too narrow, only one mode of p(x) might be visited. On the other 
hand, if it is too wide, the rejection rate can be very high, resulting in high correlations. If all 
the modes are visited while the acceptance probability is high, the chain is said to “mix” well. 
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3.2. Simulated annealing for global optimization 

Let us assume that instead of wanting to approximate p(x), we want to find its global 
maximum. For example, if p(x) is the likelihood or posterior distribution, we often want 
to compute the ML and maximum a posteriori (MAP) estimates. As mentioned earlier, we 
could run a Markov chain of invariant distribution p(x) and estimate the global mode by 

x = argmax p(x (!) ). 

This method is inefficient because the random samples only rarely come from the vicinity 
of the mode. Unless the distribution has large probability mass around the mode, computing 
resources will be wasted exploring areas of no interest. A more principled strategy is to 
adopt simulated annealing (Geman & Geman, 1984; Kirkpatrick, Gelatt, & Vecchi, 1983; 
Van Laarhoven & Arts, 1987). This technique involves simulating a non-homogeneous 
Markov chain whose invariant distribution at iteration i is no longer equal to p(x), but to 


pfx) cx p' /T '(x), 
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1. Initialise x( 0 ) and set T 0 = 1. 

2. For i = 0 to N — 1 

- Sample u ~ W[ 0 , !]• 


Sample x* ~ g(x*|xW). 



— Set T ; + 1 according to a chosen cooling schedule. 


Figure 8. General simulated annealing algorithm. 



Figure 9. Discovering the modes of the target distribution with the simulated annealing algorithm. 

where T t is a decreasing cooling schedule with 7j = 0. The reason for doing this 

is that, under weak regularity assumptions on p(x), p°°(x) is a probability density that 
concentrates itself on the set of global maxima of p(x). The simulated annealing involves, 
therefore, just a minor modification of standard MCMC algorithms as shown in figure 8 . The 
results of applying annealing to the example of the previous section are shown in figure 9. 

To obtain efficient annealed algorithms, it is again important to choose suitable proposal 
distributions and an appropriate cooling schedule. Many of the negative simulated annealing 
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results reported in the literature often stem from poor proposal distribution design. In some 
complex variable and model selection scenarios arising in machine learning, one can even 
propose from complex reversible jump MCMC kernels (Section 3.7) within the annealing 
algorithm (Andrieu, de Freitas, & Doucet, 2000a). If one defines a joint distribution over 
the parameter and model spaces, this technique can be used to search for the best model 
(according to MDL or AIC criteria) and ML parameter estimates simultaneously. 

Most convergence results for simulated annealing typically state that if for a given 7) , 
the homogeneous Markov transition kernel mixes quickly enough, then convergence to the 
set of global maxima of p{x) is ensured for a sequence 7) = (C ln(i + 7o)) -1 , where C and 
T 0 are problem-dependent. Most of the results have been obtained for finite spaces (Geman 
& Geman, 1984; Van Laarhoven & Arts, 1987) or compact continuous spaces (Haario & 
Sacksman, 1991). Some results for non-compact spaces can be found in Andrieu, Breyer, 
and Doucet (1999). 

3.3. Mixtures and cycles of MCMC kernels 

A very powerful property of MCMC is that it is possible to combine several samplers into 
mixtures and cycles of the individual samplers (Tierney, 1994). If the transition kernels K \ 
and K2 have invariant distribution p(-) each, then the cycle hybrid kernel K \ Ko and the 
mixture hybrid kernel vK\ + (1 — v)K2, for 0 < v < 1, are also transition kernels with 
invariant distribution p(-). 

Mixtures of kernels can incorporate global proposals to explore vast regions of the 
state space and local proposals to discover finer details of the target distribution (Andrieu, 
de Freitas, & Doucet, 2000b; Andrieu & Doucet, 1999; Robert & Casella, 1999). This will 
be useful, for example, when the target distribution has many narrow peaks. Here, a global 
proposal locks into the peaks while a local proposal allows one to explore the space around 
each peak. For example, if we require a high-precision frequency detector, one can use 
the fast Fourier transform (FFT) as a global proposal and a random walk as local proposal 
(Andrieu & Doucet, 1 999). Similarly, in kernel regression and classification, one might want 
to have a global proposal that places the bases (kernels) at the locations of the input data and 
a local random walk proposal that perturbs these in order to obtain better fits (Andrieu, de 
Freitas, & Doucet, 2000b). However, mixtures of kernels also play a big role in many other 
samplers, including the reversible jump MCMC algorithm (Section 3.7). The pseudo-code 
for a typical mixture of kernels is shown in figure 10. 

Cycles allow us to split a multivariate state vector into components (blocks) that can be 
updated separately. Typically the samplers will mix more quickly by blocking highly cor- 
related variables. A block MCMC sampler, using bj to indicate the y-th block, to denote 
the number of blocks and x^'j = {x ( ^ l> . x^ +1) , . . . , x% x^ | , . . . , }, is shown in 

figure 11. The transition kernel for this algorithm is given by the following expression 

7fMH-Cycle(^ (,+1) | X (l) ) = f] ^MH© {xff^ | X$J , 


where denotes the y'-th MH algorithm in the cycle. 
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1. Initialise i®. 

2. For i = 0 to N - 1 

- Sample u ~ W[ 0 ,i]- 

- Ifuct- 

Apply the MH algorithm with a global proposal. 

Apply the MH algorithm with a random walk proposal. 


Figure 10. Typical mixture of MCMC kernels. 


1. Initialise x<°>. 

2. For i = 0 to IV — 1 

- Sample the block ccj' +1 * according to an MH step with pro- 
posal distribution <7i(x£* +1 *|x^j*j,a4'*) and invariant distribution 

— Sample the block x^ +1 * according to an MH step with pro- 
posal distribution and invariant distribution 

P(41 +1 V-£])- 


— Sample the block according to an MH step with proposal 
distribution and invariant distribution 


Figure 11. Cycle of MCMC kernels — block MH algorithm. 

Obviously, choosing the size of the blocks poses some trade-offs. If one samples the 
components of a multi-dimensional vector one-at-a-time, the chain may take a very long 
time to explore the target distribution. This problem gets worse as the correlation between 
the components increases. Alternatively, if one samples all the components together, then 
the probability of accepting this large move tends to be very low. 

A popular cycle of MH kernels, known as Gibbs sampling (Geman & Geman, 1984), is 
obtained when we adopt the full conditional distributions plxj \ x_ ; ) = p(xj \ xi , . . . , Xj- 1, 
Xj + 1 as proposal distributions (for notational simplicity, we have replaced the index 

notation bj with j). The following section describes it in more detail. 

3.4. The Gibbs sampler 

Suppose we have an « -dimensional vector x and the expressions for the full conditionals 
p(Xj | *i, ..., Xj-\,Xj + i , . . . , x„). In this case, it is often advantageous to use the following 
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proposal distribution for j = \ ..... n 

lo Otherwise. 

The corresponding acceptance probability is: 
p(x*)q(xV \x*) \ 


*4(x (0 ,x*) = min | 1, 


= min{ 1, 


P(x«) q (x * |,«>) | 
p{x*)p(xf | X^j) | 
p(x^)p(x*\x*_j) j 

’h4)} =1 ' 


That is, the acceptance probability for each proposal is one and, hence, the deterministic 
scan Gibbs sampler algorithm is often presented as shown in figure 12. 

Since the Gibbs sampler can be viewed as a special case of the MH algorithm, it is 
possible to introduce MH steps into the Gibbs sampler. That is, when the full conditionals 
are available and belong to the family of standard distributions (Gamma, Gaussian, etc.), 
we will draw the new samples directly. Otherwise, we can draw samples with MH steps 
embedded within the Gibbs algorithm. For n = 2, the Gibbs sampler is also known as the 
data augmentation algorithm, which is closely related to the expectation maximisation (EM) 
algorithm (Dempster, Laird, & Rubin, 1977; Tanner & Wong, 1987). 

Directed acyclic graphs (DAGS) are one of the best known application areas for Gibbs 
sampling (Pearl, 1987). Here, a large-dimensional joint distribution is factored into a directed 
graph that encodes the conditional independencies in the model. In particular, if x pa (j) 


1. Initialise xo.im- 



2. For i = 0 to N — 1 



- Sample xj i+1) 

~p(xi|x£\x! 

x<?). 

- Sample 

~p(x 2 |xj ,+l) , 

xW-.-.x^). 

- Sample x*‘ +1 * 

~p(x,|x< i+1 \ 

_<i+l> _(<) _(*)', 

> x j+i ’•■■’X» )■ 

- Sample x« +1) 

-P(x„|xr> 



Figure 12. Gibbs sampler. 




INTRODUCTION 


23 


denotes the parent nodes of node Xj, we have 

pm = n i>( x j i x puu))- 


It follows that the full conditionals simplify as follows 

p( x J I ;) = P{ x j I x paU)) II P( Xk I x p°m) 

kech(j ) 

where ch(j) denotes the children nodes of xj. That is, we only need to take into account 
the parents, the children and the children’s parents. This set of variables is known as the 
Markov blanket of Xj. This technique forms the basis of the popular software package for 
Bayesian updating with Gibbs sampling (BUGS) (Gilks, Thomas, & Spiegelhalter, 1994). 
Sampling from the full conditionals, with the Gibbs sampler, lends itself naturally to the 
construction of general purpose MCMC software. It is sometimes convenient to block some 
of the variables to improve mixing (Jensen, Kong, & Kjaerulff, 1995; Wilkinson & Yeung, 
2002 ). 

3.5. Monte Carlo EM 

The EM algorithm (Baum et al., 1970; Dempster, Laird, & Rubin, 1977) is a standard 
algorithm for ML and MAP point estimation. If X contains visible and hidden variables 
x = {x v , Xh}, then a local maximum of the likelihood p(x v | 0) given the parameters 9 can 
be found by iterating the following two steps: 

1 . E step. Compute the expected value of the complete log-likelihood function with respect 
to the distribution of the hidden variables 

Q{9) = £ log (p(x h ,x v | 9))p(x h | 0 (old) ) dx H , 

where 0 UM> refers to the value of the parameters at the previous time step. 

2. M step. Perform the following maximisation 9 (new) = argmax 0 Q(9). 

In many practical situations, the expectation in the E step is either a sum with an exponen- 
tially large number of summands or an intractable integral (Ghahramani, 1995; Ghahramani 
& Jordan, 1995; McCulloch, 1994; Pasula et al., 1999; Utsugi, 2001); see also Dellaert et al. 
(this issue). A solution is to introduce MCMC to sample from p(xh \ x v , 0 <M} ) and replace 
the expectation in the E step with a small sum over the samples, as shown in figure 13. 
Convergence of this algorithm is discussed in Sherman, Ho, and Dalai (1999), while Levine 
and Casella (2001) is a good recent review. 

To improve the convergence behaviour of EM, namely to escape low local minima and 
saddle points, various authors have proposed stochastic approaches that rely on sampling 
from p{Xh | x v , 0 ((M> ) in the E step and then performing the M step using these samples. 
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1. Initialise (xj^\ff^) and set i = 0. 

2. Iteration % of EM 

— Sample {x^ }^Lj with any suitable MCMC algorithm. For exam- 
ple, one could use an MH algorithm with acceptance probability 

l ’ p{x v |4° , 6 l4 J> ) > 

— E step: Compute 

m = ^.f^ogp(x ( j\x v \e) 

— M step: Maximise flW = arg maxQ(0). 

3. i <- i + 1 and go to 2. 


Figure 13. MCMC-EM algorithm. 

The method is known as stochastic EM (SEM) when we draw only one sample (Celeux 
& Diebolt, 1985) and Monte Carlo EM (MCEM) when several samples are drawn (Wei 
& Tanner, 1990). There are several annealed variants (such as SAEM) that become more 
deterministic as the number of iterations increases (Celeux & Diebolt, 1992). The are also 
very efficient algorithms for marginal MAP estimation (SAME) (Doucet, Godsill, & Robert, 
2000). One wishes sometimes that Metropolis had succeeded in stopping the proliferation 
of acronyms! 

3.6. Auxiliary variable samplers 

It is often easier to sample from an augmented distribution p(x. u), where u is an auxiliary 
variable, than from p{x). Then, it is possible to obtain marginal samples x in by sampling 
(x (l> , u (l > ) according to p(x, u) and, subsequently, ignoring the samples u u> . This very useful 
idea was proposed in the physics literature (Swendsen & Wang, 1987). Here, we will focus 
on two well-known examples of auxiliary variable methods, namely hybrid Monte Carlo 
and slice sampling. 

3.6.1. Hybrid Monte Carlo. Hybrid Monte Carlo (HMC) is an MCMC algorithm that 
incorporates information about the gradient of the target distribution to improve mixing 
in high dimensions. We describe here the “leapfrog” HMC algorithm outlined in Duane 
et al. (1987) and Neal (1996) focusing on the algorithmic details and not on the statistical 
mechanics motivation. Assume that p{x) is differentiable and everywhere strictly positive. 
At each iteration of the HMC algorithm, one takes a predetermined number ( L ) of deter- 
ministic steps using information about the gradient of p(x). To explain this in more detail, 
we first need to introduce a set of auxiliary “momentum” variables u e and define the 
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1. Initialise x<°>. 


2. For i = 0 to IV 

-1 

— Sample v 

>~£/ [0il) andu*~W(0,/„J. 

- Let x 0 = 

xb) and ito = u* + pA(x 0 )/2. 

- For l = 1 

, . . . , L, take steps 


X t = Xl—i + pui-1 


ui=ut-i + piA(xi) 

where pi 

= p for l < L and pi = p/2. 

- \fv<A 

= min I 1 ’ ex P (-|(«W - «* T u*)) } 


( I (i+i) )U (i+ 1 )) = ( XLi u L ) 

else 



(xb+b ,„«+!)) = (*«),„*) 


Figure 14. Hybrid Monte Carlo. 

extended target density 

p(x,u) = p(x)Af(u-0,I llx ). 

Next, we need to introduce the n x -dimensional gradient vector A(x) = 3 log p(x)/dx and 
a fixed step-size parameter p > 0. 

In the HMC algorithm, we draw a new sample according to p(x, u) by starting with 
the previous value of x and generating a Gaussian random variable u. We then take L 
“frog leaps” in u and x. The values of u and x at the last leap are the proposal candidates 
in the MH algorithm with target density p(x, u). Marginal samples from p(x) are ob- 
tained by simply ignoring u. Given the algorithm proceeds as illustrated in 

figure 14. 

When only one deterministic step is used, i.e. L — 1 , one obtains the Langevin algorithm, 
which is a discrete time approximation of a Langevin diffusion process. The Langevin 
algorithm is a special case of MH where the candidate satisfies 

x* = xq + puq = x (l " + p(«* + pA(x^' ^)/2) 

with u* ~ Af (0, I„ x ). 

The choice of the parameters L and p poses simulation tradeoffs. Large values of p 
result in low acceptance rates, while small values require many leapfrog steps (expensive 
computations of the gradient) to move between two nearby states. Choosing L is equally 
problematic as we want it to be large to generate candidates far from the initial state, but 
this can result in many expensive computations. HMC, therefore, requires careful tuning of 
the proposal distribution. It is more efficient, in practice, to allow a different step size p for 
each of the coordinates of x (Ishwaran, 1999). 
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3.6.2. The slice sampler. The slice sampler (Damien, Wakefield, & Walker, 1999; Higdon, 
1998; Wakefield, Gelfand, & Smith, 1991) is a general version of the Gibbs sampler. The 
basic idea of the slice sampler is to introduce an auxiliary variable u e K. and construct an 
extended target distribution p*(x, u ), such that 

* _ {' if 0 < w < p(x) 

^ {o otherwise. 

It is then straightforward to check that 

J p*{x,u)du = J du = p(x). 

Hence, to sample from p(x) one can sample from p*(x, u) and then ignore u. The full 
conditionals are of this augmented model are 

p(u | x) = W [0 ,pU)](m) 
p{x | u) = U A (x) 

where A = {x\ p(x) > u). If A is easy to identify then the algorithm is straightforward to 
implement, as shown in figure 15. 

It can be difficult to identify A. It is then worth introducing several auxiliary variables 
(Damien, Wakefield, & Walker, 1999; Higdon, 1998). For example assume that 

L 

p(x) (X ]"~[ /;(*), 

where the // (-)’s are positive functions, not necessarily densities. Let us introduce L auxiliary 
variables {u\ , . . . , ul) and define 

L 

p*(x,ui,...,u L )cx n»» ,Mx)](.Ui). 



Figure 15. Slice sampling: given a previous sample, we sample a uniform variable between 0 and f(x^). 
One then samples in the interval where f(x) > 
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1. For J = 1,...,L 

- Sample ~ W[ 0 ,/,(*<i-»)](«i). 

2. Sample xi$ ~ U A u>(x) where = | x\fi(x) > , l = 1, . . . ,i|. 

Figure 16. Slice sampler. 

Then one can also check that f p*(x, u \ , ur)du \ . . . du^ = p(x) as 

/ P*Oc,uu...,u L )du 1 ...du L cc J du\ . . ,du L = ]~ [ fi(x). 

The slice sampler to sample from p*(x, u\, , ul) proceeds as shown in figure 16. Al- 
gorithmic improvements and convergence results are presented in Mira (1999) and Neal 
(2000). 

3. 7. Reversible jump MCM C 

In this section, we attack the more complex problem of model selection. Typical exam- 
ples include estimating the number of neurons in a neural network (Andrieu, de Freitas, 
& Doucet, 2001a; Holmes & Mallick, 1998; Rios Insua & Muller, 1998), the number of 
splines in a multivariate adaptive splines regression (MARS) model (Holmes & Denison, 
this issue), the number of sinusoids in a noisy signal (Andrieu & Doucet, 1999), the number 
of lags in an autoregressive process (Troughton & Godsill, 1998), the number of com- 
ponents in a mixture (Richardson & Green, 1997), the number of levels in a change- 
point process (Green, 1995), the number of components in a mixture of factor analy- 
sers (Fokoue & Titterington, this issue), the appropriate structure of a graphical model 
(Friedman & Roller, 2001; Giudici & Castelo, this issue) or the best set of input variables 
(Lee, this issue). 

Given a family of M models {M . m ; m = 1 , . . . , N], we will focus on constructing ergodic 
Markov chains admitting p{m, x m ) as the invariant distribution. For simplicity, we avoid 
the treatment of nonparametric model averaging techniques; see for example (Escobar & 
West, 1995; Green & Richardson, 2000). 

Up to this section, we have been comparing densities in the acceptance ratio. However, 
if we are carrying out model selection, then comparing the densities of objects in different 
dimensions has no meaning. It is like trying to compare spheres with circles. Instead, we 
have to be more formal and compare distributions P(dx) = Pr(* e dx) under a common 
measure of volume. The distribution P(dx) will be assumed to admit a density p(x) with 
respect to a measure of interest, e.g. Lebesgue in the continuous case: P(dx) = p(xjdx. 
The acceptance ratio will now include the ratio of the densities and the ratio of the measures 
(Radon Nikodym derivative). The latter gives rise to a Jacobian term. To compare densities 
point- wise, we need, therefore, to map the two models to a common dimension as illustrated 
in figure 17. 




28 


C. ANDRIEU ET AL. 


P( x i>X ; 


:) 


Univariate density 

PfxO & Ptx^x*) 



Proposex* 

uniformly 



Compare both 

densities 

point-wise 



Figure 1 7. To compare a ID model against a 2D model, we first have to map the first model so that both models 


The parameters x m e X m (e.g. X m = M"™) are model dependent. Hence, to find the right 
model and parameters we could sample over the model indicator and the product space 
n"=i (Carlin & Chib, 1995). Recently, Green introduced a strategy that avoids this 
expensive search over the full product space (Green, 1995). In particular one samples on a 
much smaller union space X = Um=i ( m ) x The full target distribution defined in this 
space is given by 


p(k , dx)= ^ p(m, dx m )l [m]xXm (k, x). 


That is, the probability of k being equal to m and x belonging to an infinitesimal set centred 
around x m is p(m , dx m ). By marginalisation, we obtain the probability of being in subspace 
X m . 

Green’s method allows the sampler to jump between the different subspaces. To ensure 
a common measure, it requires the extension of each pair of communicating spaces, X m 
and X n , to X m n = X m x U mn and X n m = X n x U, hm . It also requires the definition of a 
deterministic, differentiable, invertible dimension matching function f„^ m between X m n 
and X n ^ m , 

(Xm, u m , n ) = f n ^ m (x n , U nM ) = {f^ m (x n , U, hm ), f^ m (x n , U n , m )). 
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We define such that f m -+ n (f n -+ m (x n , u nm )) = (x n , u nm ). The choice of the extended 
spaces, deterministic transformation f m ^.„ and proposal distributions for q n ^ m (- 1 n, x n ) and 
q m ^ n (- 1 tn, x m ) is problem dependent and needs to be addressed on a case by case basis. 

If the current state of the chain is ( n , x n ), we move to (m, x m ) by generating u nm ~ 
1 n, x n ), ensuring that we have reversibility (x m , u m n ) = f n ^ m (x n , u n m ), and ac- 
cepting the move according to the probability ratio 


A n ^ m 


P(m,x *) q{n \ m ) g m ^„(u m , n I »»,**) 

P(n,x„) q(m\n) q n ^ m (u n ,m I n, x n ) f " 


where x* m = f£_+ m (x n , u n m ) and J is the Jacobian of the transformation f„^ m (when 
only continuous variables are involved in the transformation) 


Jf m ^ n = det- 


d(x m , u M: „) 


To illustrate this, assume that we are concerned with sampling the locations /i and number 
k of components of a mixture. For example we might want to estimate the locations and 
number of basis functions in kernel regression and classification, the number of mixture 
components in a finite mixture model, or the location and number of segments in a segmen- 
tation problem. Here, we could define a merge move that combines two nearby components 
and a split move that breaks a component into two nearby ones. The merge move involves 
randomly selecting a component and then combining it with its closest neighbour (/x 2 ) 
into a single component n, whose new location is 


Ml + P-2 

* = — 2 ~ 

The corresponding split move that guarantees reversibility, involves splitting a randomly 
chosen component as follows 


P\ — P~ Un.mP 


P2 = P + U„ ym P 


where p is a simulation parameter and, for example, u n m ~ ZT[o,i]- Note that to ensure 
reversibility, we only perform the merge move if ||/X| — /x 2 II < 2 p. The acceptance ratio 
for the split move is 


A sp ii, 


p(k+ 1, n k+ 1) 

P(k, n k ) 


P(u n , m ) 


Jsplit 


where 1 / k denotes the probability of choosing, uniformly at random, one of the k compo- 
nents. The Jacobian is 


_ d(m,p 2 ) _ 1 1 

Jsplit ~ d(p,u n , m ) ~ -p p 


= 2 P. 
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1. Initialisation: set 

2. For i = 0 to N - 1 

- Sample u 

- If (u < bk) 

• then "birth” move. 

• else if (u<bk+dk) then “death" move. 

• else if (« < i>* + djt + s k ) then "split" move. 

• else if (« <bk+dk+Sh + m k ) then "merge” move. 

• else update. 

End If. 

— Sample other parameters. 


Figure 1 8. Generic reversible jump MCMC. 


Similarly, for the merge move, we have 


. [ p(k - 

, e — mini 1, x — ; 


where J merg e = 1/2/3. 

Reversible jump is a mixture of MCMC kernels (moves). In addition, to the split and 
merge moves, we could have other moves such as birth of a component, death of a component 
and a simple update of the locations. The various moves are carried out according to the 
mixture probabilities (bk, dk, m^, Sk, Uk), as shown in figure 18. In fact, it is the flexibility 
of including so many possible moves that can make reversible jump a more powerful 
model selection strategy than schemes based on model selection using a mixture indicator 
or diffusion processes using only birth and death moves (Stephens, 1997). However, the 
problem with reversible jump MCMC is that engineering reversible moves is a very tricky, 
time-consuming task. 


4. The MCMC frontiers 

4. 1. Convergence and perfect sampling 

Determining the length of the Markov chain is a difficult task. In practice, one often dis- 
cards an initial set of samples ( bum-in ) to avoid starting biases. In addition, one can ap- 
ply several graphical and statistical tests to assess, roughly, if the chain has stabilised 
(Robert & Casella, 1999, ch. 8). In general, none of these tests provide entirely satisfactory 
diagnostics. 

Several theoreticians have tried to bound the mixing time ; that is, the minimum number 
of steps required for the distribution of the Markov chain K to be close to the target p(x). 
(Here, we present a, by no means exhaustive, summary of some of the available results.) If 
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we measure closeness with the total variation norm A x (t), where 
A x (t> = i|r n (- | x) - J (K«\y | x) - p(y)) dy, 


then the mixing time is 

r x (e) = min{t : A x (t') < e for all t' > t}. 

If the state space X is finite and reversibility holds true, then the transition operator 
K (K f(x) = J2 K (y I x)f(y)) is self adjoint on L 2 (p). That is, 

(Kf \g) = {f\ Kg), 

where / and g are real functions and we have used the bra-ket notation for the inner product 
(f \ g) = Y1 /ix)g(x)ptx). This implies that K has real eigenvalues 


and an orthonormal basis of real eigenfunctions ft, such that K f) = Xj f) . This spectral 
decomposition and the Cauchy-Schwartz inequality allow us to obtain a bound on the total 
variation norm 


A x (t) < 


2 


where A.* = max(A. 2 , |k|^| |) (Diaconis & Saloff-Coste, 1998; Jerrum & Sinclair, 1996). This 
classical result give us a geometric convergence rate in terms of eigenvalues. Geometric 
bounds have also been obtained in general state spaces using the tools of regeneration and 
Lyapunov-Foster conditions (Meyn & Tweedie, 1993). 

The next logical step is to bound the second eigenvalue. There are several inequalities 
(Cheeger, Poincare, Nash) from differential geometry that allows us to obtain these bounds 
(Diaconis & Saloff-Coste, 1998). For example, one could use Cheeger’s inequality to obtain 
the following bound 


where <t> is the conductance of the Markov chain 


Y,xcS,yc-S' P(-*)K(y I -0 

mm — 

0<p(S)<l/2;Sor p(S ) 


Intuitively, one can interpret this quantity as the readiness of the chain to escape from any 
small region S of the state space and, hence, make rapid progress towards equilibrium 
(Jerrum & Sinclair, 1996). 
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These mathematical tools have been applied to show that simple MCMC algorithms 
(mostly Metropolis) run in time that is polynomial in the dimension d of the state space, 
thereby escaping the exponential curse of dimensionality. Polynomial time sampling algo- 
rithms have been obtained in the following important scenarios: 

1 . Computing the volume of a convex body in d dimensions, where d is large (Dyer, Frieze, 
& Kannan, 1991). 

2. Sampling from log-concave distributions (Applegate & Kannan, 1991). 

3. Sampling from truncated multivariate Gaussians (Kannan & Li, 1996). 

4. Computing the permanent of a matrix (Jerrum, Sinclair, & Vigoda, 2000). 

The last problem is equivalent to sampling matchings from a bipartite graph; a problem 
that manifests itself in many ways in machine learning (e.g., stereo matching and data 
association). 

Although the theoretical results are still far from the practice of MCMC, they will even- 
tually provide better guidelines on how to design and choose algorithms. Already, some 
results tell us, for example, that it is not wise to use the independent Metropolis sampler in 
high dimensions (Mengersen & Tweedie, 1996). 

A remarkable recent breakthrough was the development of algorithms for perfect sam- 
pling. These algorithms are guaranteed to give us an independent sample from p(x ) under 
certain restrictions. The two major players are coupling from the past (Propp & Wilson, 
1998) and Fill’s algorithm (Fill, 1998). From a practical point of view, these algorithms are 
still limited and, in many cases, computationally inefficient. However, some steps are being 
taken towards obtaining more general perfect samplers; for example perfect slice samplers 
(Casella et al., 1999). 


4.2. Adaptive MCMC 

If we look at the chain on the top right of figure 7, we notice that the chain stays at each state 
for a long time. This tells us that we should reduce the variance of the proposal distribution. 
Ideally, one would like to automate this process of choosing the proposal distribution as much 
as possible. That is, one should use the information in the samples to update the parameters of 
the proposal distribution so as to obtain a distribution that is either closer to the target distri- 
bution, that ensures a suitable acceptance rate, or that minimises the variance of the estimator 
of interest. However, one should not allow adaptation to take place infinitely often in a naive 
way because this can disturb the stationary distribution. This problem arises because by using 
the past information infinitely often, we violate the Markov property of the transition kernel. 
That is, p(x (l) \ x m> . x 0> , . . . , no longer simplifies to p(x U) | x (,-1) ). In particular, 

Gelfand and Sahu (1994) present a pathological example, where the stationary distribution is 
disturbed despite the fact that each participating kernel has the same stationary distribution. 

To avoid this problem, one could carry out adaptation only during an initial fixed number 
of steps, and then use standard MCMC simulation to ensure convergence to the right distribu- 
tion. Two methods for doing this are presented in Gelfand and Sahu (1994). The first is based 
on the idea of running several chains in parallel and using sampling-importance resampling 
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(Rubin, 1988) to multiply the kernels that are doing well and suppress the others. In this 
approach, one uses an approximation to the marginal density of the chain as proposal. The 
second method simply involves monitoring the transition kernel and changing one of its com- 
ponents (for example the proposal distribution) so as to improve mixing. A similar method 
that guarantees a particular acceptance rate is discussed in Browne and Draper (2000). 

There are, however, a few adaptive MCMC methods that allow one to perform adaptation 
continuously without disturbing the Markov property, including delayed rejection (Tierney 
& Mira, 1999), parallel chains (Gilks & Roberts, 1996) and regeneration (Gilks, Roberts, & 
Sahu, 1998; Mykland, Tierney, & Yu, 1995). These methods are, unfortunately, inefficient 
in many ways and much more research is required in this exciting area. 

4.3. Sequential Monte Carlo and particle filters 

Sequential Monte Carlo (SMC) methods allow us to carry out on-line approximation of 
probability distributions using samples (particles). They are very useful in scenarios involv- 
ing real-time signal processing, where data arrival is inherently sequential. Furthermore, 
one might wish to adopt a sequential processing strategy to deal with non-stationarity in 
signals, so that information from the recent past is given greater weighting than information 
from the distant past. Computational simplicity in the form of not having to store all the 
data might also constitute an additional motivating factor for these methods. 

In the SMC setting, we assume that we have an initial distribution, a dynamic model and 
measurement model 


P(x o) 

P(x, \x 0:t -i, Jia-i) for f > 1 
p(yt \ X 0:t , JlU-l) for f > 1 

We denote by xo :t — {*o, • • • , x t ] and y\- t = {yi , . . . , »), respectively, the states and the ob- 
servations up to time t . Note that we could assume Markov transitions and conditional inde- 
pendence to simplify the model; p(x 1 1 xo-j-i, yi-.t-i) = p(x t \ x,_i ) and p(y, \ xq-j, y \ :t -\ ) = 
p(y t | x t ). However, this assumption is not necessary in the SMC framework. 

Our aim is to estimate recursively in time the posterior p(x o :t \ y\ - t ) and its associated 
features including the marginal distribution p(x t \ y\ :t ), known as the filtering distribution, 
and the expectations 

Kf,) = Ut (xo-.t)] 

A generic SMC algorithm is depicted in figure 19. Given N particles {jcqJ t } -JLj at 
time t — 1, approximately distributed according to the distribution p(x o :f _i | y\ :t -i), SMC 
methods allow us to compute N particles {xq\}^ =1 approximately distributed according to 
the posterior p(x o :r |yi :f ), at time t. Since we cannot sample from the posterior directly, 
the SMC update is accomplished by introducing an appropriate importance proposal dis- 
tribution q(xo-.t) from which we can obtain samples. The samples are then appropriately 
weighted. 
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Figur 

which provides an approximation of p(x f _i | yi-t-i)- For each particle we compute the importance weights using 
the information at time I — 1. This results in the weighted measure {xJ‘2j, which yields an approximation 

p(\t- 1 | Subsequently, the resampling step selects only the “fittest” particles to obtain the unweighted 

measure {xj'^j , A -1 ), which is still an approximation of p(x,_i | Finally, the sampling (prediction) step 

introduces variety, resulting in the measure {xj‘\ A -1 ), which is an approximation of p(x t | yi : ,_i). 


Sequential importance sampling step 

- For i = 1, 

A, sample from the transition priors 


2|* ) ~9t (xik&f-i>3h*) 

and set 

^=CM-0 

— For i - 1, 

A, evaluate and normalize the importance weights 


w (i) oc p P (“t'VoS-l.SW-l) 



Selection step 


- Multiply/Discard particles with high/low importance 

weights u>l' > to obtain A particles j 


Figure 20. Simple SMC algorithm at time t. For filtering purposes, there is no need for storing or resampling 
the past trajectories. 
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In generic SMC simulation, one needs to extend the current paths {xq\_ x to obtain 
new paths {JCq.^}^Y = 1 using the proposal distribution q{xQj\y X:t ) given by 

<?(*0:r I Tl:f)} = J q(XQ:t I *0:f-l> )h:f)P(*0:t-l \ dx 0:t -l . 

To make this integral tractable, we only propose to modify the particles at time t , and leave 
the past trajectories intact. Consequently 

q(x 0:r I yi:t) = p(*0:t-l I > ! l:/ I )</(,?/ I *0:»-D >'h/) 

The samples from #(•), must be weighted by the importance weights 

= P(x 0:t | yi:r) = p(XQ :t - 1 | yu,) p(x, I XQ:,-uyu,) 

' q(x 0: , | yut) p(x 0:t -i I yi-t-i) q(x, \ x 0:f _i, yi :t ) 

a P(y> I x,)p(x, I xq-j-i, yi;t-i) (22) 

I x Q :t _\ , >’!:() 

From Eq. (22), we note that the optimal importance distribution is 
q{x, I Xo-J 1. >'!:/) = p(x f | Xq-j-I, >’!:()• 

(When using this proposal, one might still encounter difficulties if the ratio of the first two 
terms of Eq. (22) differs significantly from 1 (Andrieu, Doucet, & Punskaya, 2001; Pitt & 
Shephard, 1999).) The optimal importance distribution can be difficult to evaluate. One can 
adopt, instead, the transition prior as proposal distribution 

q(x t | xo : t—i , y\\t) = p{x,\ x 0:t ~i, yi-.t-i) 

in which case the importance weights are given by the likelihood function 

w, oc p ( y t | x t ) . 

This simplified version of SMC has appeared under many names, including condensation 
(Isard & Blake, 1996), survival of the fittest (Kanazawa, Roller, & Russell, 1995) and the 
bootstrap filter (Gordon, Salmond, & Smith, 1993). The importance sampling framework 
allows us to design more principled and “clever” proposal distributions. For instance, one can 
adopt suboptimal filters and other approximation methods that make use of the information 
available at time t to generate the proposal distribution (Doucet, Godsill, & Andrieu, 2000; de 
Freitas et al., 2000; Pitt & Shephard, 1999; van der Merwe et al., 2000). In fact, in some 
restricted situations, one may interpret the likelihood as a distribution in terms of the states 
and sample from it directly. In doing so, the importance weights become equal to the 
transition prior (Fox et al., 2001). 

After the importance sampling step, a selection scheme associates to each particle x^.j 
a number of “children”, say /V, e N, such that = N. This selection step is what 
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allows us to track moving target distributions efficiently by choosing the fittest particles. 
There are various selection schemes in the literature, but their performance varies in terms 
of var[Ni] (Doucet, de Freitas, & Gordon, 2001). 

An important feature of the selection routine is that its interface only depends on particle 
indices and weights. That is, it can be treated as a black-box routine that does not require 
any knowledge of what a particle represents (e.g., variables, parameters, models). This 
enables one to implement variable and model selection schemes straightforwardly. The 
simplicity of the coding of complex models is, indeed, one of the major advantages of these 
algorithms. 

It is also possible to introduce MCMC steps of invariant distribution p(x $ :t \ y\ :t ) on each 
particle (Andrieu, de Freitas, & Doucet, 1999; Gilks & Berzuini, 1998; MacEachern, Clyde, 
& Liu, 1999). The basic idea is that if the particles are distributed according to the poste- 
rior distribution p{x o :l | y\ :t ). then applying a Markov chain transition kernel K (x* yt \ xo :t ), 
with invariant distribution p(- \ y\-,) such that / K (x* yt \ xo :t )p(xo :t \ y\- t ) = p(x* yt \ still 
results in a set of particles distributed according to the posterior of interest. However, the 
new particles might have been moved to more interesting areas of the state-space. In fact, 
by applying a Markov transition kernel, the total variation of the current distribution with 
respect to the invariant distribution can only decrease. Note that we can incorporate any 
of the standard MCMC methods, such as the Gibbs sampler, MH algorithm and reversible 
jump MCMC, into the filtering framework, but we no longer require the kernel to be 
ergodic. 


4.4. The machine learning frontier 

The machine learning frontier is characterised by large dimensional models, massive datasets 
and many and varied applications. Massive datasets pose no problem in the SMC context. 
However, in batch MCMC simulation it is often not possible to load the entire dataset 
into memory. A few solutions based on importance sampling have been proposed recently 
(Ridgeway, 1999), but there is still great room for innovation in this area. 

Despite the auspicious polynomial bounds on the mixing time, it is an arduous task 
to design efficient samplers in high dimensions. The combination of sampling algorithms 
with either gradient optimisation or exact methods has proved to be very useful. Gradient 
optimisation is inherent to Langevin algorithms and hybrid Monte Carlo. These algorithms 
have been shown to work with large dimensional models such as neural networks (Neal, 
1996) and Gaussian processes (Barber & Williams, 1997). Information about derivatives of 
the target distribution also forms an integral part of many adaptive schemes, as discussed 
in Section 2.3. Recently, it has been argued that the combination of MCMC and variational 
optimisation techniques can also lead to more efficient sampling (de Freitas et al., 2001). 

The combination of exact inference with sampling methods within the framework of Rao- 
Blackwellisation (Casella & Robert, 1996) can also result in great improvements. Suppose 
we can divide the hidden variables x into two groups, u and u, such that p(x) = p(v \ u)p(u) 
and, conditional on u, the conditional posterior distribution p(v \ u) is analytically tractable. 
Then we can easily marginalise out v from the posterior, and only need to focus on sampling 
from p(u), which lies in a space of reduced dimension. That is, we sample u <l) ~ p{u) and 
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then use exact inference to compute 

l« (0 ) 

By identifying “troublesome” variables and sampling them, the rest of the problem can 
often be solved easily using exact algorithms such as Kalman filters, HMMs or junction 
trees. For example, one can apply this technique to sample variables that eliminate loops in 
graphical models and then compute the remaining variables with efficient analytical algo- 
rithms (Jensen, Kong, & Kjaerulff, 1995; Wilkinson & Yeung, 2002). Other application areas 
include dynamic Bayesian networks (Doucet et al., 2000), conditionally Gaussian models 
(Carter & Kohn, 1994; De Jong & Shephard, 1995; Doucet, 1998) and model averaging 
for graphical models (Friedman & Roller, this issue). The problem of how to automatically 
identify which variables should be sampled, and which can be handled analytically is still 
open. An interesting development is the augmentation of high dimensional models with 
low dimensional artificial variables. By sampling only the artificial variables, the original 
model decouples into simpler, more tractable submodels (Albert & Chib, 1993; Andrieu, de 
Freitas, & Doucet, 2001b; Wood & Kohn, 1998); see also Holmes and Denison (this issue). 
This strategy allows one to map probabilistic classification problems to simpler regression 
problems. 

The design of efficient sampling methods most of the times hinges on awareness of 
the basic building blocks of MCMC (mixtures of kernels, augmentation strategies and 
blocking) and on careful design of the proposal mechanisms. The latter requires domain 
specific knowledge and heuristics. There are great opportunities for combining existing 
sub-optimal algorithms with MCMC in many machine learning problems. Some areas that 
are already benefiting from sampling methods include: 


1. Computer vision. Tracking (Isard & Blake, 1996; Ormoneit, Lemieux, & Fleet, 2001), 
stereo matching (Dellaert et al., this issue), colour constancy (Forsyth, 1999), restoration 
of old movies (Morris, Fitzgerald, & Kokaram, 1996) and segmentation (Clark & Quinn, 
1999; Kam, 2000; Tu & Zhu, 2001). 

2. Web statistics. Estimating coverage of search engines, proportions belonging to specific 
domains and the average size of web pages (Bar-Yossef et al., 2000). 

3. Speech and audio processing. Signal enhancement (Godsill & Rayner, 1998; Vermaak 
etal., 1999). 

4. Probabilistic graphical models. For example (Gilks, Thomas, & Spiegelhalter, 1994; 
Wilkinson & Yeung, 2002) and several papers in this issue. 

5. Regression and classification. Neural networks and kernel machines (Andrieu, de 
Freitas, & Doucet, 2001a; Holmes & Mallick, 1998; Neal, 1996; Muller & Rios 
Insua, 1998), Gaussian processes (Barber & Williams, 1997), CART (Denison, Mallick, 
& Smith, 1998) and MARS (Holmes & Denison, this issue). 

6. Computer graphics. Light transport (Veach & Guibas, 1997) and sampling plausible 
solutions to multi-body constraint problems (Chenney & Forsyth, 2000). 
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7. Data association. Vehicle matching in highway systems (Pasula et al., 1999) and mul- 
titarget tracking (Bergman, 1999). 

8. Decision theory. Partially observable Markov decision Processes (POMDPs) (Thrun, 
2000; Salmond & Gordon, 2001), abstract Markov policies (Bui, Venkatesh, & West, 
1999) and influence diagrams (Bielza, Muller, & Rios Insua, 1999). 

9. First order probabilistic logic. (Pasula & Russell, 2001). 

10. Genetics and molecular biology. DNA microarray data (West et al., 2001), cancer gene 
mapping (Newton & Lee, 2000), protein alignment (Neuwald et al., 1997) and linkage 
analysis (Jensen, Kong, & Kjasrulff, 1995). 

11. Robotics. Robot localisation and map building (Fox et al., 2001). 

12. Classical mixture models. Mixtures of independent factor analysers (Utsugi, 2001) and 
mixtures of factor analysers (Fokoue & Titterington, this issue). 

We hope that this review will be a useful resource to people wishing to carry out further 
research at the interface between MCMC and machine learning. For conciseness, we have 
skipped many interesting ideas, including tempering and coupling. For more details, we 
advise the readers to consult the references at the end of this paper. 
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